

# Pairwise sequence alignment with the Smith-Waterman algorithm

Manel Fernández

Intel HPC Software Workshop Series 2016

HPC Code Modernization for Intel® Xeon and Xeon Phi™

February 18<sup>th</sup> 2016, Barcelona



# Pairwise sequence alignment



Comparison applied to *pairs* of biomolecular sequences

RNA, DNA, or amino-acids

The goal is to identify *how similar* they are (homology)



Also used on non-biological sequences, such as natural language and financial data



### Pairwise comparison of Naegleria NgTet1 vs E. coli AlkB (a-c) and human ABH2 (d-f)

From "Structure of a Naegleria Tet-like dioxygenase in complex with 5-methylcytosine DNA", by H. Hashimoto Et Al., Nature 506, 391–395 (20 February 2014)





# Pairwise sequence alignment methods



### Global alignment

• Covers the entire sequences

 Needleman-Wunsch algorithm finds the best global alignment

### Local alignment

Only covers parts of the sequences

 Smith-Waterman (S-W) algorithm finds the best local(s) alignment(s)



# Smith-Waterman (SW) algorithm

### Step 1. Compute the *alignment matrix* or *score matrix* ( $\in \mathbb{N}$ )

• Best alignment starts on the [i,j] location with the highest score value in the matrix

$$H[i,0] = H[0,j] = E[0,j] = F[i,0] = 0$$

$$0 \qquad -$$

$$H[i-1,j-1] + \sigma(S_1[i],S_2[j]) \qquad \textit{Match/Mismatch}$$

$$E[i,j] = \max \begin{cases} E[i-1,j] - \omega_e & \textit{Deletion} \\ H[i-1,j] - \omega_o - \omega_e & (S_1[i] \textit{ aligned to a gap}) \end{cases}$$

$$F[i,j] = \max \begin{cases} F[i,j-1] - \omega_e & \textit{Insertion} \\ H[i,j-1] - \omega_o - \omega_e & (S_2[j] \textit{ aligned to a gap}) \end{cases}$$

 $0 \le i \le m, 0 \le j \le n$ 

 $1 \le i \le m, 1 \le j \le n$ 

#### Where

- $S_1[1..m]$ ,  $S_2[1..n]$  are sequences over alphabet  $\Sigma$
- $\sigma(a,b)$  is the similarity function on alphabet letters (substitution matrix)
- $\omega_o$  and  $\omega_e$  are gap opening and gap extension penalty scheme

### Step 2. Find the *optimum local alignment* in the matrix

- From the highest value, go backwards on the direction of used to construct the matrix
- Alignment sequence ends when a matrix cell with zero value is reached





# Smith-Waterman algorithm: example

$$S_1=$$
 AGCACACA,  $S_2=$  ACACACTA  $\omega_o=-1$ ,  $\omega_e=0$ ,  $\sigma(a,b)=egin{cases} +2,a=b\ -1,a\neq b \end{cases}$ 

Step 1

$$H = \begin{pmatrix} - & A & C & A & C & A & C & T & A \\ - & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ A & 0 & 2 & 1 & 2 & 1 & 2 & 1 & 0 & 2 \\ G & 0 & 1 & 1 & 1 & 1 & 1 & 1 & 0 & 1 \\ C & 0 & 0 & 3 & 2 & 3 & 2 & 3 & 2 & 1 \\ A & 0 & 2 & 2 & 5 & 4 & 5 & 4 & 3 & 4 \\ C & 0 & 1 & 4 & 4 & 7 & 6 & 7 & 6 & 5 \\ A & 0 & 2 & 3 & 6 & 6 & 9 & 8 & 7 & 8 \\ C & 0 & 1 & 4 & 5 & 8 & 8 & 11 & 10 & 9 \\ A & 0 & 2 & 3 & 6 & 7 & 10 & 10 & 10 & 12 \end{pmatrix}$$

Step 2

$$S_1 = AGCACAC-A$$
  
 $S_2 = A-CACACTA$ 



# SW score matrix example



### Simple example

- Slightly different sequences
- Similar length (~500 symbols)

### Colour gradient for matrix values

- Blue for H[i,j] = 0
- Red for high score values

### High homology

- About 352 score for this example
- Not necessarily always the case



# Other algorithm considerations

### Only address Step 1 of the algorithm (\*)

- Build the score matrix ("recurrence") and find the highest score ("reduce")
- We will briefly discuss Step 2 later on this presentation

### Algorithm complexity

- Computational complexity: O(mn) (quadratic)
  - "Recurrence" and "reduce" steps can be fused
- Space complexity: <del>O(mn)</del> (quadratic) ?? O(m+n) (linear)
  - No need to store the whole matrix to find the highest score value



(\*) Code modernization proposed by "Pairwise DNA sequence alignment optimization" by Yongchao Liu and Bertil Schmidt ("High performance parallelism pearls", Vol. 2, Ch. 4)



# Code modernization of S-W algorithm

How to parallelize the score matrix computation?

- Dependency when parallelizing by rows or columns
- Parallelize over NE-SW diagonals!

#### Assuming

- $S_1$  is placed vertically, indexed by i, and length $(S_1) = m$
- $S_2$  is placed horizontally, indexed by j, and length( $S_2$ ) = n
- *m* ≤ *n*

#### We can see that

- Diagonal d = i + j, where  $0 \le d < m + n 1$
- Starting row index  $srow_d = max(0, d n + 1)$
- Ending row index  $erow_d = \min(d, m-1)$





# Intra-process (SMP) parallelization





### Scale-and-vectorize approach

- Partition the matrix into tiles
  - Tile size is  $\alpha$  rows  $\times \beta$  columns
  - Matrix size is  $R = \left[\frac{m}{\alpha}\right]$  rows  $\times C = \left[\frac{n}{\beta}\right]$  columns
- TLP on tiles for every external diagonal
- DLP on every internal diagonal

### Current implementation parameters

• 
$$C = \text{#threads}$$
,  $\beta = \left[\frac{n}{\text{#threads}}\right]$ 

- $\alpha = SIMD_{length} \times K$ 
  - $SIMD_{length} = 8$  for AVX2, 16 for AVX-512/IMCI
  - Small K (i.e., K=1) maximize external diagonals with #threads tiles, since  $\alpha \ll \beta$  and  $R \gg C$



# S-W in action

Simple example animation

- 8 threads (column tiles)
- Taller tiles than actual ones





# Thread parallelism with OpenMP







Shared by all threads



# OpenMP implementation

```
#pragma omp parallel firstprivate(GV,GD) default(shared)
for (d = 0; d < R+C-1; d++)
    sExtRow = max(0, d-C+1);
    eExtRow = min(d, R-1);
    Load the per-thread maximum score lmaxScore;
    #pragma omp barrier
    #pragma omp for schedule(static,1) nowait
    for (r = sExtRow; r <= eExtRow; r++)</pre>
        Compute all cells in tile (r,d-r);
    Save ImaxScore to a global variable;
    Swap the input and output of GV and GD;
```

Threads need to be synchronized before moving to the next external diagonal.

For every diagonal, every thread will process one tile.

No need to synchronize at the end since updates are not shared.

Place together threads with consecutive IDs to improve locality.

S-W within a tile

Reduce to get the optimal local alignment score;



# Auto-vectorization with SIMD

```
for (int row = 0; row < alpha; row++)
    for (int col = 0; col < beta; col++)
        H[row][col] = max(0, ...);</pre>
```



loop was not vectorized: vector dependence prevents vectorization

```
for (int d = 0; d < (alpha + beta - 1); d++)
  int row = max(0, d - beta + 1);
  int erow = min(d, alpha - 1);
  int col = min(d, beta - 1);
  int ecol = max(0, d - alpha + 1);
  for (; (row <= erow) && (col >= ecol); row++, col--)
      H[row][col] = max(0, ...);
```



▶ loop was not vectorized: parallel loop condition operator must be one of <, <=, >, >=, or !=



# Auto-vectorization with SIMD (cont'd)

```
// main (assume beta > alpha)
for (int d = alpha - 1; d < beta; d++)
    for (int row = 0, int col = d; row < alpha; row++, col--);
        H[row][col] = max(0, ...);</pre>
```



➤ LOOP WAS VECTORIZED, \*but\*...

Remainder

# Resulting code plenty of gather/scatter constructs

- SIMD elements are not consecutive in memory
- Scatter instruction does not even exists in AVX2 (software scatter)

#### What other choices?

- Guided (Cilk Plus)
- Low-level vectorization (SIMD C++ classes, intrinsics)



Main

# Vectorization with SIMD intrinsics

# Main loop without peel/remainder (IMCI)

Compute column index gCol for current tile;

```
for (d = alpha - 1; d < beta; d++, qCol++)
    // vE[:] = max((GH[qcol].e :: vE[:VL-1]) - qe, (GH[qcol].h :: vH[:VL-1]) - qoe)
    vHup = mm512 alignr epi32(vH, mm512 set1 epi32(GH[gCol].h), 15);
    vE = mm512 \text{ alignr epi32}(vE, mm512 \text{ set1 epi32}(GH[gCol].e), 15);
    vE = mm512 max epi32( mm512 sub epi32(vE, vGapE), mm512 sub epi32(vHup, vGapOE));
    // vF[:] = max(vF[:] - qe, vH[:] - qoe)
    vF = mm512 max epi32 ( mm512 sub epi32 (vF, vGapE), mm512 sub epi32 (vH, vGapOE));
    Compute vScore[:] = \sigma(vS1[:], S2[qCol] :: vS2[0:VL-1]);
    // vH[:] = max(vZero, vD[:] + vScore[:], vE, vF)
    vH = mm512 max epi32 (mm512 max epi32 (vD, vScore), mm512 max epi32 (vE, vF));
    vD = vHup; // Diagonal values for next iteration
    Save vH[VL-1] and vE[VL-1] on GH (new cell on the last row of the stripe);
```

Save vH[0] and vF[0] on GV.out (new cell on the last column of the stripe);



# Inter-process parallelization with MPI



Second level tiling

- Blocks size is  $\alpha_{mpi} imes \beta_{mpi}$
- Matrix size is  $R_{mpi} = \left[\frac{m}{\alpha_{mpi}}\right] \times C_{mpi} = \left[\frac{n}{\beta_{mpi}}\right]$
- $\beta_{mpi} = \left[\frac{n}{P}\right]$ , being P number of MPI ranks

Distributed approach

- Block-level GH, GV, GD distributed buffers
- MPI communication between block diagonals
  - Non-blocking MPI communication primitives
    - MPI\_Isend(), MPI\_Irecv()
- MPI Barrier() for synchronization

Final reduction with MPI\_Reduce()



# Performance evaluation: preliminary questions

Hint: not all answers need to be correct here...

### What's the ideal target? Xeon, Xeon Phi, or offload?

- Problem is highly parallel, Xeon Phi should be better target
- There should be no big difference in using native or offload approach
  - Computational complexity bigger than space complexity

### Compute bound or memory bandwidth bound?

- Kind of "stencil" which tends to be memory bound
- But, proposal does not build a real matrix, so should be compute bound

### What's the expected strong/weak scalability?

- Highly parallel and compute bound might indicate good scalability
- Possible issues for MPI/thread communication/synchronization



# Methodology

|                                  | Y. Liu and B. Schmidt                                                                                                                                                                                                       |           |          | Bayncore       |                                  |  |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|----------|----------------|----------------------------------|--|
| Intel® Xeon® Processor           | E5-2670 (2s, 8c/s, 2ht/c)                                                                                                                                                                                                   |           |          |                | E5-2650 v2 (2s, 8c/s, 2ht/c)     |  |
| Intel® Xeon Phi™ Coprocessor     | 4x 5110P (60c, 4t/c, 1GHz)                                                                                                                                                                                                  |           |          |                | 2x 7120A (61c, 4t/c, 1.2GHz)     |  |
| Operating System                 | Linux                                                                                                                                                                                                                       |           |          |                | RHEL Server 6.6 (Santiago)       |  |
| Intel® C++ Compiler Version      | 15.0 ?                                                                                                                                                                                                                      |           |          |                | 15.0.2.164 (build 20150121)      |  |
| OpenMP Affinity                  | Balanced                                                                                                                                                                                                                    |           |          |                |                                  |  |
| MPI Library                      | OpenMPI                                                                                                                                                                                                                     |           |          | Intel® MPI 5.0 |                                  |  |
| S-W MPI Implementation           | Only MPI + Offload + Xeon Phi SIMD source code provided MPI ranks and MPI communication on the processor Every rank offloads to a coprocessor (1-rank per coprocessor) $\alpha = SIMD_{length}, \qquad \alpha_{mpi} = 128K$ |           |          |                |                                  |  |
| S-W Input Set (Seq1 vs Seq2) (*) | Seq.                                                                                                                                                                                                                        | Length    | Accessio | n No.          | Genome definition                |  |
|                                  | S4.4                                                                                                                                                                                                                        | 4,411,532 | NC_000   | 962.3          | Mycobacterium tuberculosis H37Rv |  |
|                                  | S4.6                                                                                                                                                                                                                        | 4,641,652 | NC_000   | 913.3          | Escherichia coli K12 MG1655      |  |

<sup>(\*)</sup> More longer sequences in Y. Liu and B. Schmidt's performance evaluation



# Intra-coprocessor scalability







# Inter-coprocessor scalability

#### Smith-Waterman, s4.4 vs s4.6





# Inter-coprocessor scalability (cont'd)





# Compute bound or bandwidth bound?



Estimated Latency Impact value is high, which likely indicates that the majority of L1 data cache misses are not being serviced by the L2 cache. Software prefetching is one strategy for improving this on the Intel Xeon Phi coprocessor. Data reorganization or traditional techniques to increase data locality (such as...



# S-W Step 2: Finding optimum local alignment

Traceback procedure leading from highest score value

It would require to store the complete score matrix, which is prohibitive!

Proposal: store partial information of score matrix

- Then recompute only the tiles following the alignment path
- Space needed from m x n down to C x R





# Code modernization: not a single solution

A whole family of parallel programming models







### Cache oblivious algorithm

• Cilk Plus, TBB, OpenMP teams with nested parallelism, etc.

```
void diamond sw (matrix d)
    if (d is small)
        small sw(d);
    else
        divide d into d0, d1, d2, d3;
        diamond sw(d0);
        cilk spawn diamond sw(d1);
        /*nospawn*/diamond sw(d2);
        cilk sync;
        diamond sw(d3)
```



# "Future" improvements

### Intrinsics are rarely a long-term option

- Compiler *must* vectorize the diagonal-based loop in source code
  - Demand the compiler to vectorize the source code loop (w/ or w/o pragmas)
  - Properly generating "alignr" with the expression: { c[1:n] = c[0:n-1]; c[0] = x; }
- File compiler defects at Intel Premier

### Observation: values in score matrix are:

- Relatively "small" natural numbers (n <  $2^k$ ,  $n \in \mathbb{N} \land k = \{8,16\}$ )
- Stored in 32-bit cells (i.e., 32-bit vector elements)
- Idea: we want longer vectors!

### AVX-512 – Byte and Word Instruction (BWI) extension

- 512-bit vectors also supporting 16- and 8-bit datatypes (32 and 64 elements!)
  - DNA algorithms will shine on future AVX-512 BWI extension
- Currently supported by Intel compilers, but not by Intel hardware yet



# Summary

### Intel<sup>®</sup> Xeon Phi<sup>TM</sup> speedup

- vs Intel<sup>®</sup> Xeon Phi<sup>TM</sup> w/o SIMD: 15.4x
- vs Intel® Xeon® w/o SIMD: 12.1x
- vs "best" Intel® Xeon® w/ SIMD: ?

### Benchmark ends to be compute bound

- Not bad intra-process scalability: 60%-70% efficiency
- Excellent inter-process scalability: 80%-90% efficiency

### Still room for additional improvements

- Little extra performance by using MPI ranks with native Xeon Phi binaries
- Software prefetching to mitigate current latency problems
- It would be desirable to get rid of intrinsics
  - Using auto-vectorization or Intel® Cilk Plus™ array notation
- Significant future improvement targeting AVX-512 BWI extension